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Abstract 

The  dynamic  tree  expression  problem  (DTEP)  was  defined  in  [Ma87].  In  this  paper, 
efficient  implementations  of  the  DTEP  algorithm  are  developed  for  the  hypercube, 
butterfly,  perfect  shuffle  and  multi-dimensional  mesh  of  trees  families  of  networks. 
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1  Introduction 


The  dynamic  tree  expression  problem  (DTEP)  was  introduced  by  Mayr  [Ma87]  and  is 
based  upon  previous  work  by  Ruzzo  [Ru80],  Miller  &  Reif  [MR85]  and  Ullman  &  Van 
Gelder  [UV85].  This  paper  develops  efficient  implementations  of  the  DTEP  algorithm  for 
the  hypercube,  butterfly,  perfect  shuffle  and  multi-dimensional  mesh  of  trees  families  of 
networks. 

In  Section  2  we  give  the  formal  definition  of  DTEP  and  an  algorithm  for  solving  it, 
which  will  be  referred  to  as  the  DTEP  algorithm.  Section  3  provides  the  details  of  the 
computational  model  of  the  networks  we  are  considering,  along  with  implementations 
for  two  useful  primitive  operations.  In  Section  4  these  primitives  are  used  to  produce 
implementations  of  the  DTEP  algorithm  and  an  analysis  of  their  time  and  communication 
requirements  is  performed.  We  will  be  primarily  concerned  with  implementations  for  single 
instruction  stream,  multiple  data  stream  (SIMD)  parallel  computers,  a  well-known  class 
first  defined  by  Flynn  [F166]. 

There  is  a  list  of  symbols  in  the  appendix  which  should  serve  to  clarify  the  programming 
notation. 

2  The  DTEP  Algorithm 

A  DTEP  instance  is  a  triple  (P,I,Z)  where 

1.  P  is  a  set  of  n  Boolean  variables  po,...  ,Pn-i, 

2.  /  is  a  set  of  inference  rules  of  the  form  pi  pj  or  p,-  ^  PjPk, 

3.  Z  CP. 

The  task  is  to  compute  the  minimal  model  for  (P,  J,  Z),  ie.  the  minimal  M  C  P  satisfying 

1.  ZCM, 

2.  (pj  e  M)  A  {pi  Pj  G  1)  =>  Pi  G  M,  and 

3.  {Pj.Pk  e  M)  A  {pi  pjPk  ^  T)  ^  Pi  e  M. 

The  Boolean  variables  belonging  to  Z  may  be  thought  of  as  axioms.  TIk'  inference  miles 
are  applied  to  these  axioms  to  prove  as  many  of  the  remaining  variables  true  as  possible. 
A  derivation  tree  for  p  G  P  is  a  lal^elled  binary  tree  with  node  labels  taken  from  P  such 
that:  (i)  labels  of  the  leaves  belong  to  Z;  (ii)  if  an  internal  node  has  label  p,  and  a  single 
child  labelled  p,  then  p,-  pj  £  J;  (iii  )  if  a.n  internal  node  has  label  p;  and  children  labelled 
Pj,  pk  then  Pi  <—  p^pk  G  /;  (iv)  the  root  is  labelled  p.  The  size  of  a  derivation  tree  T  is  the 
number  of  nodes  that  it  contains  and  is  written  |T|. 

Clearly,  p  £  M  if  and  only  if  there  is  a  derivation  trc'e  for  p.  The  following  algorithm 
makes  use  of  this  fact  to  construct  the  minimal  model  M  for  a  given  DTEP  instance 
(P,I,Z  ).  We  will  consider  a.  parallel  implementation  using  processors,  each  identified 
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by  a  unique  triple  {i,j,k),  0  <  i,j,k  <  n.  There  are  n  +  variables,  which  are 

initialized  as  follows 

1.  Pi  =  {pi  e  Z),0<i<  n, 

2.  Pij  =  {i=j)V  {pi  pj  e  I),  0  <  ij  <  n,  and 

3.  Pijk  =  (pi  PjPk  e  /)  V  {pi  <-  pkPj  €  /),  0  <  iJ,  k<n. 

procedure  DTEP 

(1)  loop 

(2)  Pi  ^  {Pj  APk^  Pijk)  V  {Pj  A  Pij) 

(3)  Pij  ^Pk  A  Pijk 

(4)  exit  when  no  new  P,  has  been  derived 

(5)  [Pd  [Pd" 

(6)  end  loop 
end  DTEP 

Line  5  of  DTEP  computes  the  square  of  the  Boolean  matrix  [Pij]-  The  problem  of  im¬ 
plementing  general  matrix  multiplication  on  the  hypercube  and  perfect  shuffle  was  studied 
extensively  by  Dekel,  Nassimi  and  Sahni  [DNSSl].  For  the  special  case  of  Boolean  matrix 
multiplication,  Agerwala  and  Lint  have  given  a  parallel  implementation  of  the  four  Rus¬ 
sians’  algorithm  which  runs  in  O(logn)  time  using  n^/(lognloglogn)  processors  [AL78]. 

Since  the  diagonal  entries  of  the  matrix  [Py]  are  always  true,  any  variable  which 
becomes  true  at  any  time  during  the  course  of  the  computation  will  remain  so.  Using  the 
method  of  Miller  h  Reif  it  can  be  proven  that  if  p,  E  M  has  a  derivation  tree  T  in  (P,  /,  Z) 
then  Pi  becomes  true  within  at  most  (log4/3  2)log  |r|  iterations  of  the  loop  [MR85].  The 
correctness  of  the  terminating  condition  used  above  is  easy  to  establish  using  a  proof  by 
contradiction. 

The  parallel  running  time  for  a  single  iteration  depends  upon  the  model  of  computation. 
On  a  CRCW  PRAM,  each  iteration  can  be  performed  in  0(1)  time  even  when  concurrent 
writes  must  agree;  the  trick  is  to  write  only  true  values.  On  a  CREW  PRAM,  each 
iteration  can  he  implemented  to  run  in  O(log  n)  time  by  using  a  tree  computation  whenever 
it  is  necessary  to  OR  together  many  Boolean  values.  The  EREW  PRAM  remains  powerful 
enough  to  achieve  0(log7r)  but  the  constant  of  proportionality  is  higher  than  for  CREW 
since  there  are  instances  where  a  tree  computation  must  be  used  to  make  copies  of  a  value 
needed  by  many  processors. 

Thus,  DTEP  is  guaranteed  to  run  in  AfC  whenever  each  p,-  E  M  has  a  derivation  tree 
of  “expolylog”  size,  that  is,  bounded  by  for  some  constant  c  >  0.  An  important 

consequence  of  this  is  that  any  problem  which  can  be  transformed,  within  jVC,  to  a  deriva¬ 
tion  system  with  expolylog  bounded  derivation  trees  is  itself  in  J\fC.  The  planar  monotone 
circuit  value  prolflem  [GoSO],  known  to  be  in  J\fC,  is  an  exa.mple  of  a  problem  which  admits 
such  a  transformation. 
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3  Network  Primitives:  Replicate  and  Collect 

Our  first  goal  is  to  develop,  in  a  systematic  manner,  “efficient”  implementations  of  the 
DTEP  algorithm  for  several  well-known  networks.  Assume  without  loss  of  generality  that 
the  given  DTEP  instance  (P,  7,  .Z)  has  |P|  equal  to  a  power  of  two  and  define  n,  m,  iV  and 
M  by  the  equations 

n  =  |P|,  m  =  log?z,  N  =  n^,  M  =  log  A. 

We  adopt  the  usual  abstract  view  of  a  network  as  a  graph  in  which  nodes  represent  pro¬ 
cessors  and  edges  represent  bidirectional  communication  links.  Each  processor  has  a  local 
memory  with  words  of  length  O(log  ?7.)  and  we  make  the  uniform  cost  model  assumption 
that  the  standard  set  of  ALU  operations  can  be  performed  in  constant  time  on  operands  of 
this  size.  Every  processor  is  assigned  a  unique  0(log  n)  bit  number  which  will  be  referred 
to  as  its  id. 

We  will  initially  restrict  our  attention  to  SIMD  parallel  computers.  One  way  to  under¬ 
stand  the  SIMD  model  is  to  imagine  many  processors  synchronously  executing  duplicate 
copies  of  a  program  with  no  conditional  branch  instructions  and  in  which  every  statement 
Si  is  accompanied  by  a  Boolean  condition  Ci.  The  statements  of  the  program  operate  on 
local  variables  and  data  received  through  messages  from  adjacent  processors.  Each  pro¬ 
cessor  has  the  same  set  of  local  variables  as  any  other,  but  they  may  have  different  initial 
values.  There  is  no  global  memorj".  We  will  not  be  concerned  with  the  question  of  how 
the  hetwork  communicates  with  the  outside  world;  the  input  (output)  is  simply  given  by 
the  initial  (final)  values  of  some  subset  of  the  variables. 

Program  execution  takes  place  in  the  following  manner.  When  all  of  the  processors 
arrive  simultaneously  (as  must  be  the  case)  at  some  statement  5,’,  they  first  evaluate  Ci. 
Those  for  which  Ci  is  true  are  said  to  be  enabled  and  proceed  to  execute  Si.  The  remaining 
processors  are  disabled  for  the  period  of  time  that  it  takes  to  execute  Si.  This  process  is 
then  repeated  at  the  statement  following  S,-.  In  our  programs,  the  condition  C^  will  be 
a  function  of  the  processor  id  which  can  be  computed  in  constant  time.  For  example,  if 
the  processor  id  is  r  and  Ci  is  given  by  the  expression  2:5  =  1  A  Z[o,3)  =  -[6,9)  d  could  be 
evaluated  in  constant  time  by  the  “machine”  expression 

(z  AND  ?7?.i)  ^  0  A  {z  AND  777.2)  =  ((-  DIV  777.3)  AND  7772) 

where  777.]  =  IOOOOO2,  m2  =  III2  and  777.3  =  10000002  are  masks  obtained  in  constant  time 
by  indexing  into  tables  which  can  be  precomputed  in  0(log77)  time.^ 

Algorithmic,  complexity  will  l^e  measured  in  terms  of  communication  overhead  as  well 
as  time.  We  will  consider  the  execution  of  a  program  to  consist  of  a  sequence  of  steps.  Each 
step  is  allowed  only  0(1)  time  and  is  nmdc  up  of  a  computation  phase  and  a  communication 
phase.  During  the  computation  jihase  no  messages  are  passed  between  processors.  During 
the  communication  phase  each  processor  can  .send  (and/or  receive)  an  0(log?7)  length 

^Table  lookup  is  not.  necessary  if  wo  are  given  an  instruction  capable  of  shifting  a  register  0(log7i)  bit 
positions  in  constant  time  (eg.  MUL);  imcler  this  assumpl.ion  tlie  ma.sks  can  be  computed  on  the  fly. 
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message  to  (from)  each  of  its  neighbors.  Define  the  communication  cost  of  an  algorithm 
to  be  the  total  number  of  messages  which  it  uses.  We  will  sometimes  refer  to  the  total 
number  of  steps  used  by  an  algorithm  as  its  step  count.  In  this  paper,  “exact”  step  counts 
should  be  interpreted  as  being  accurate  to  within  an  additive  constant,  eg.  51ogn  means 
51ogn  +  0(1).  Note  that  a  step  count  of  /(n)  implies  a  running  time  which  is  0(/(n)). 

For  each  network  family  we  will  describe  two  implementations  of  the  DTEP  inner  loop 
which  attempt  to  minimize:  (i)  step  count;  (ii)  communication  cost  under  the  constraint 
that  the  step  count  lie  within  a  constant  factor  of  optimal,  ie.  it  must  be  O(logn).  For  a 
synchronous,  fixed  interconnection  network  there  is  little  motivation  for  minimizing  com¬ 
munication  cost  since  a  communication  phase  will  use  the  same  amount  of  time  regardless 
of  how  many  links  are  actually  used  to  send  messages.  However,  the  amount  of  message 
traffic  may  be  of  critical  importance  in  a  time-shared  asynchronous  environment  or  when 
the  network  for  which  the  algorithm  has  been  designed  is  being  simulated  on  another 
type  of  network.  We  will  also  indicate  how  much  improvement  in  the  running  time  can 
be  obtained  by  modifying  our  implementations  slightly  to  take  advantage  of  a  multiple 
instruction  stream,  multiple  data  stream  (MIMD)  environment. 

The  motivation  for  counting  steps  is  to  allow  the  constant  multiplicative  factors  on 
the  leading  term  of  the  running  time  of  two  programs  to  be  compared  with  reasonable 
accuracy  without  resorting  to  the  tedious  approach  of  counting  up  CPU  cycles.  If  it  is  true 
that  the  running  times  of  individual  steps  tend  to  be  clustered  around  a  single  value  then 
this  approximation  will  be  a  useful  one.  Unfortunately,  our  definition  of  a  step  allows  k 
independent  calculations  to  be  “interleaved”  in  such  a  way  that  the  step  count  goes  down 
by  a  factor  of  k  while  the  actual  running  time  remains  more  or  less  unchanged.  This  is 
accomplished  by  passing  all  local  data  which  is  relevant  to  any  of  the  k  calculations  to 
all  neighbors  which  require  any  data  and  merging  the  computation  phases.  In  order  to 
preserve  the  desired  correlation  between  running  time  and  step  count,  we  do  not  allow 
interleaving  in  our  minimum  step  count  implementations. 

The  observations  made  in  Section  2  regarding  ER.EW  PRAMs  indicate  that  any  net¬ 
work  which  admits  an  efficient  implementation  of  the  DTEP  algorithm  must  be  able  to 
rapidly:  (i)  distribute  copies  of  a  particular  value  to  many  processors,  and  (ii)  OR  together 
many  values  stored  at  different  processors.  This  motivates  the  definition  of  two  primitive 
operations  which  we  refer  to  as  Replicate  and  Collect.  The  Replicate  primitive  takes  four 
arguments:  a  pointer  p,  non-negative  integers  start  and  width  satisfying  start  +  width  <  M 
and  an  integer  select  which  should  be  thought  of  as  a.  width-hit  mask.  The  effect  of  the 
operation  may  be  written 

at  Z  \  ♦p  at  ^^siari+'widik,\J)  ^  ^  — [O.j/nrf) 

where  denotes  the  processor  id.  For  example,  if  p  points  to  some  variable  x,  M  =  6, 
start  =  3,  width  =  2  and  select  =  OI2  then  x  at  processor  (^5  ♦  *r2ri.:o)2  would  be  assigned 
the  value  of  x  at  processor  (ssOlr-iriro)^-  An  important  observation  to  make  at  this  point 
is  that  by  passing  a  field  of  the  processor  id  to  select  rather  than  a  constant,  it  is  possible 
to  perform  transposition.  There  are  se^^eral  examples  of  this  usage  of  Replicate  in  Section  4. 
The  Collect  primitive  requires  only  the  first  three  of  the  above  parameters  and  performs 


4 


Network 

Processors 

Degree 

Diameter 

High  Flux 

Layout  Area 

hypercube 

W  =  n*  =  2*"^ 

logN 

logN 

yes 

0(iW) 

butterfly 

iVlogiV 

4 

logN 

yes 

e{N^) 

perfect  shuffle 

N 

3 

2  log  AT 

yes 

eiNyiog^n) 

kD  mesh  of  trees 

{k  -1-  l)iV  - 

2,3,  k 

2  log  AT 

no 

0(Ar2-2A),  k>2 

Table  1:  Some  important  network  properties. 


the  operation 


*p  at  z 


Vo<t<2“'*‘^*^  ^[start-\-v/idih,M)  ®  ^[Q, width)  ®  ^[0, start)  ^[start,stari+width) 

undefined  otherwise. 


Usually  a  call  of  the  form  Collect(p,  start,  width)  will  be  followed  by  Replicate(p,  start, 
width,  0)  to  obtain  the  combined  effect 

at  Z  \J  at  ■2[start+u'*ti<A,A/)  ®  ^[OyWidth)  ^  ^[Q, start)'  (^) 

We  now  consider  the  problem  of  implementing  these  two  primitive  operations  on  the 
hypercube,  butterfly,  perfect  shuffle  and  multi-dimensional  mesh  of  trees  networks. 


3.1  Hypercube 

A  degree  d  hypercube  has  2*^  processors  with  ids  ranging  from  0  to  2'^  —  1.  Processor  i 
is  adjacent  to  processor  j  if  and  only  if  the  binary  representations  of  i  and  j  differ  in 
a  single  bit  position.  Some  important  properties  of  the  hypercube  as  well  as  the  other 
networks  which  we  will  study  are  given  in  Table  1.  The  hypercube  has  high  flux^  but 
unbounded  degree.  Our  programs  contain  if  statements  but  could  easily  be  cast  into  the 
format  described  in  the  previous  section. 

The  hypercube  implementations  of  Replicate  and  Collect  are  given  below.  Both  use  widih 
steps.  Note  that  Replicate  would  perform  exactly  the  same  function  with  the  condition  in 
line  2  simplified  to  Zstart+i  =  selecti,  but  this  would  increase  the  communication  cost  from 
0(N)  to  O(WlogA^).  The  exact  communication  cost  of  both  Replicate  as  well  as  Collect  is 
N  —  messages,  which  is  approximately  N  for  any  non-trivial  va.lue  of  width.  With 

regard  to  Collect,  it  is  possible  to  achieve  the  effect  of  equation  1  directly  by  removing 
lines  7  and  9.  This  saves  width  time  by  eliminating  the  need  for  a  call  to  Replicate,  but 
increases  the  communication  cost  to  0{N  logN).  We  will  take  advantage  of  this  trade-off 
in  Section  4.1  in  order  to  minimize  the  step  count  of  our  DTEP  implementation. 

It  is  interesting  to  note  that,  unlike  the  other  networks  listed  in  Table  1,  the  hypercube 
could  handle  replicating  over  a  non- contiguous  set  of  address  bits  just  as  easily;  however, 
this  feature  is  not  needed  for  implementing  DTEP. 

■Ronjilily  speaking,  a  network  ha.s  higli  flux  if  it  can  sort  rapidly.  For  a  formal  dednition,  .see  [GMU87]. 
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procedure  Replicate(p,  start,  width,  select) 

(1)  for  i  < —  0  to  width  —  1 

(2)  if  Z^aiaTi-{-i,siar1+width)  tlieU 

(3)  =(=p (*'“'•'+»■)  4=  *p 

(4)  end  if 

(5)  end  for 
end  Replicate 

procedure  Collect(p,  start,  width) 

(6)  for  i  < —  0  to  width  —  1 

(7)  if  Z^3tart,start+i]  —  0  tlieU 

(8)  *p  ^ 

(9)  end  if 

(10)  end  for 
end  Collect 

3.2  Butterfly 

The  “standard”  butterfly  network  of  degree  d  has  (d+l)2‘'  processors  arranged  in  d+1  rows 
of  2^.  Each  of  these  rows  is  called  a  rank,  and  the  ranks  are  numbered  consecutively  from 
0  to  d.  The  processor  adjacencies  are  defined  as  follows:  processor  z  of  rank  r  is  connected 
to  processors  z  and  2r  ©  2”  of  rank  r  + 1  for  all  r,  z  such  that  r  £  [0,  d)  and  2:  £  [0, 2“^).^  This 
means  that  processors  in  ranks  0  and  d  have  degree  two  while  the  remaining  processors 
have  degree  four.  There  is  an  obvious  variation  of  the  standard  butterfly  network  in  which 
the  dth  rank  is  eliminated  by  mapping  its  processors  onto  those  of  rank  0.  We  will  adopt 
this  variation  as  our  definition  of  a  butterfly  network,  which  explains  the  butterfly  entries 
in  Table  1  for  number  of  processors  and  node  degree. 

It  should  be  apparent  that  the  hypercube  is  nothing  more  than  a  butterfly  in  which 
all  of  the  ranks  have  been  identified;  alternatively,  the  butterfly  may  be  viewed  as  an 
expanded  version  of  the  hypercube.  As  such,  the  butterfly  can  perform  replication  and 
collection  just  as  fast  as  the  hypercube  as  long  as:  (i)  the  address  bits  in  question  form 
a  contiguous  interval,  and  (ii)  the  data,  is  initially  located  in  a  rank  corresponding  to  one 
of  the  two  endpoints  of  this  interval.  The  first  condition  is  always  satisfied  for  us  since 
Replicate  and  Collect  have  been  defined  to  operate  over  the  interval  [start,  start  +  width). 
If  the  second  condition  is  not  satisfied  then  the  butterfly  loses  ground  to  the  hypercube 
because  it  must  perform  an  Adjust  to  copy  the  data  to  one  of  the  two  appropriate  ranks. 
The  rank  chosen  will  dei^end  upon  which  ra.nk(s)  currently  hold  the  data  and  where  the 
results  will  be  needed  by  subsecpient  calculations. 

The  Replicate  and  Collect  procedures  written  below  assume  that  the  data  resides  in 
rank  0  and  also  put  the  result  in  rank  0.  This  implementation  is  sound  but  obviously 
wasteful;  in  Section  4.2  we  will  see  that  it  is  possible  to  do  without  most  of  the  calls  to 
Adjust  which  are  implied  by  a  naive  translation  of  the  hypercube  implementation  of  the 

®Note  that  our  convention  for  numbering  the  ranks  is  the  opposite  of  that  chosen  in  [UI84]. 
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DTEP  algorithm.  The  complexity  of  Adjust  is  shift  steps  and  N shift  messages,  where  we 
refer  to  the  value  of  shift  after  line  13  has  been  executed.  The  communication  cost  could 
be  decreased  in  those  cases  where  not  all  values  need  to  be  preserved  (eg.  preceding  a  call 
to  Replicate).  This  potential  optimization  has  been  omitted  since  we  were  unable  to  take 
advantage  of  it  in  our  DTEP  implementation.  The  routines  Replicate  ,  Replicate  ,  Collect 
and  Collect"  all  execute  in  width  steps  using  0{N)  messages. 

procedure  Replicate(p,  start,  width,  select) 

(1)  Adjust(p,  0,  start) 

(2)  Replicate'(p,  start,  width,  select) 

(3)  Adjust(p,  start  +  width,  M  —  start  —  width) 

or 

(4)  Adjust(p,  0,  start  +  width) 

(5)  Replicate"(p,  start  +  width,  width,  select) 

(6)  Adjust(p,  start,  M  —  start) 
end  Replicate 

procedure  Collect(p,  start,  width) 

(7)  Adjust(p,  0,  start) 

(8)  Collect'(p,  start,  width) 

(9)  Adjust(p,  start  +  width,  M  —  start  —  width) 

or 

(10)  Adjust(p,  0,  start  +  width) 

(11)  Collect"(p,  start  +  width,  width) 

(12)  Adjust(p,  sta.rt,  M  —  start) 

end  Collect 

procedure  Adjust(p,  r,  shift) 

(13)  a,  shift.  < —  (M  >  ‘Ishift)  ?  +1,  shift  :  -1,M  -  shift 

(14)  for  i  < —  0  to  shift  —  1 

(15)  r  +  cri:  <=  *p 

(16)  end  for 
end  Adjust 

procedure  Replicate'(p,  start,  ■wid.th,  select) 

(17)  for  r  < —  start  to  start  +  width  —  1 

(IS)  if  ^[r,start+widtli)  ~  tlieU 

(19)  r:  *p^^,*p'+i  <=  *p 

(20)  end  if 

(21)  end  for 
end  Replicate' 

procedure  Replicate"(p,  start,  'wulth,  select) 

(22)  for  ■/'  < —  start  +  width  —  1  downto  start 


(23)  if  Z[siari,r]  —  ^cZeC^[0,r— »<art]  then 

(24)  r:  *p,*p' <==  *p+i 

(25)  end  if 

(26)  end  for 
end  Replicate" 

procedure  Collect'(p,  start,  width) 

(27)  for  r  < —  start  to  start  +  width  —  1 

(28)  if  Z[,iart,r]  =  0  then 

(29)  r:  *p^i  *p  V  *p' 

(30)  end  if 

(31)  end  for 
end  Collect' 

procedure  Collect"(p,  start,  width) 

(32)  for  r  < —  start  +  width  -  1  downto  start 

(33)  if  zir,,tart+wida)  =  0  then 

(34)  r:  *p  > —  *P+i  V 

(35)  end  if 

(36)  end  for 
end  Collect" 

3.3  Perfect  Shuffle 

Like  the  butterfly,  the  perfect  shuffle  is  a  high  flux  network  with  bounded  degree.  It  was 
first  introduced  by  Stone  [St71].  A  base  b,  degree  d  perfect  shuffle  has  b‘^  processors  with 

ids  [0,  b'^).  Each  processor  is  linked  to  three  others  via  the  exchange,  shuffle  and  unshuffle 

connections  which  allow  processor  i  to  communicate  with  processors  (i  mod  5  =  6—1)? 

6+1 :  i+1,  i!  and  i,  respectively.  From  this  point  on  we  will  be  concerned  only  with 
the  case  6  =  2,  so  processor  i  communicates  via  the  exchange  connection  with  processor 
i  0  1.  One  may  view  the  perfect  shuffle  as  a  stripped-down  version  of  the  hypercube  with 
only  those  edges  corresponding  to  bit  0  adjacencies  remaining  (the  exchange  connection) 
and  augmented  by  some  connections  (shuffle,  unshuffle)  which  have  the  effect  of  cycling 
the  mapping  of  variables  to  processors  in  such  a  way  that  a.  bit  i  dependency  can  be 
transformed  into  a  bit  0  dependency. 

In  order  to  perform  a  Replicate  or  Collect  the  appropriate  range  of  bits  has  to  be  cycled 
through  the  low  order  position  so  that  exchange  operations  can  be  used.  The  complexity 
of  Cycle  is  shift  steps  and  N shift  messages,  where  we  refer  to  the  value  of  shift  after  line 
13  has  been  executed.  Like  the  butterfly  Adjust  procedure,  the  communication  cost  of 
Cycle  could  be  decreased  in  certain  cases.  The  routines  Replicate',  Replicate",  Collect'  and 
Collect"  all  execute  in  2width  steps  iising  0{N)  messages. 

procedure  Replicate(p,  start,  width,  select) 

(1)  Cycle(p,  M  —  start  —  width) 


S 


(2) 

(3) 


(4) 

(5) 

(6) 


(7) 

(8) 

(9) 

(10) 
(11) 
(12) 


(13) 

(14) 

(15) 

(16) 


(17) 

(18) 

(19) 

(20) 
(21) 
(22) 


(23) 

(24) 

(25) 

(26) 

(27) 

(28) 

(29) 

(30) 


Replicate'(y,  width,  select) 
Cycle(p,  start) 
or 

Cycle(p,  M  —  start) 

Replicate"(p,  width,  select) 
Cyde(p,  start  +  width) 
end  Replicate 

procedure  Collect(p,  start,  width) 
Cycle(p,  M  —  start  —  width) 
Collect'(p,  width) 

Cycle(p,  start) 
or 

Cycle(j?,  M  —  start) 

Collect"(p,  width) 

Cycle(;),  start  +  width) 
end  Collect 


procedure  Cycle(p,  shift) 

<7,  shift  < —  (M  >  2shift)  ?  +1,  shift 
for  i  * —  0  to  shift  —  1 

shuffle^ 


end  for 
end  Cycle 


*p 


—1,M  —  shift 


procedure  Replicate'(p,  width,  select) 
for  i  < —  width  —  1  downto  0 

if  zq  o  t,M)  ~  select^  then 

exchange 

end  if 
end  for 
end  Replicate' 

procedure  Replicate"(p,  width,  select) 
for  i  < —  0  to  width  —  1 

if  ^{0, width— i)  ~  then 

if  zq  =  selecti  then 

.exchange 


end  if 
end  for 
end  Replicate" 


end  if 

shuffle~^ 


^P 
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procedure  Collect'(;?,  width) 

(31)  for  i  < —  0  to  width  —  1 

(32)  if  Z[o,i)  =  0  then 

(33)  *p  ■^= 

(34)  if  zo  =  0  then 

(35)  *p  ^ 

(36)  end  if 

(37)  end  if 

(38)  end  for 
end  Collect' 

procedure  Collect"(p,  width) 

(39)  for  i  < —  0  to  width  —  1 

(40)  if  zo  0  Z[M-i,M)  =  0  then 

(41)  ^  ^pcxchange 

(42)  *p 

(43)  end  if 

(44)  end  for 
end  Collect" 

3.4  Multi-Dimensional  Mesh  of  Trees 

The  fc-diniensional  mesh  of  trees  of  side  n,  where  n  is  a  power  of  two,  may  be  constructed 
in  the  following  manner: 

1.  First  assign  a  unique  fc-tuple  of  integers  from  [0,  n)  to  each  of  n*  processors.  We  think 
of  these  as  being  arranged  at  the  corresponding  points  in  A;-space.  These  processors 
will  be  referred  to  as  leaf  processors. 

2.  For  each  dimension  d,  d  ^  [0,  A:), 

(a  )  Partition  the  leaf  processors  into  sets  of  n  such  that  the  Ar-tuples  of  the 
processors  within  a  set  differ  only  in  the  dth  component. 

(b)  For  each  such  set  of  n  processors 

i.  Arrange  the  set  in  increasing  order  of  the  dth  component. 

ii.  Connect  the  set  together  by  forming  a  binary  tree  of  height  log  n  using  r?  —  1 
new  processors  to  form  the  internal  nodes  of  the  tree. 

Thus,  the  A*- dimensional  mesh  of  trees  contains  trees  and  a  total  of 

+  kn'‘~^{n  -  1)  =  (A  +  l)?r*'  - 

processors.  An  interesting  aside  is  that  a  A-dimensional  mesh  of  trees  of  side  two  is  the 
same  as  a  degree  A  hypercube  with  every  edge  replaced  by  a  path  of  length  two. 


10 


As  indicated  in  Table  1,  the  mesh  of  trees  is  not  a  high  flux  network.  However,  it 
is  powerful  enough  to  achieve  an  O(log  n)  time  implementation  of  the  DTEP  inner  loop 
because  it  is  (not  surprisingly)  good  at  performing  tree  computations.  Our  DTEP  imple¬ 
mentation  uses  a  three-dimensional  mesh  of  trees,  but  the  routines  given  below  are  valid  for 
the  general  case.  Note  that  width  and  start  must  be  multiples  of  m.  The  step  complexity 
of  both  Replicate  and  Collect  is  2width  since  information  needs  to  be  passed  up  and  down 
the  trees.  The  communication  cost  of  Replicate  is  dominated  by  the  last  iteration  and  is 
N  +  0(n^  log  n).  The  message  complexity  of  Collect  is  dominated  by  the  first  iteration  and 
yields  the  same  result. 

When  a  call  to  Collect  spans  more  than  one  dimension,  it  is  possible  to  achieve  the 
result  of  equation  1  more  rapidly  by  employing  a  larger  number  of  messages  in  the  obvious 
fashion.  There  is  an  example  of  this  in  Section  4.4. 

procedure  Replicate(p,  start,  width,  select) 

(1)  assert  (start  mod  m  =  0)  A  [width  mod  m  =  0) 

(2)  for  i  < —  width  downto  m  step  m 

(3)  PassUp(p,  [start  +  i)/m,  start,  select) 

(4)  Replicate'(p,  [start  +  i)/m,  start,  select) 

(5)  end  for 
end  Replicate 

procedure  Collect(y,  start,  width) 

(6)  assert  [start  mod  m  =  0)  A  [width  mod  m  =  0) 

(7)  for  i  < —  0  to  width  —  m  step  m 

(8)  Collect'(p,  [start  +  i)/m,  start) 

(9)  PassDown(y,  [start  +  i)/m,  start) 

(10)  end  for 
end  Collect 

procedure  Replicate'(p,  d,  start,  select) 

(11)  for  h  * —  0  to  m  —  1 

(12)  if  Z^start,md)  ~  stari)  tlieil 

(13)  d,  h:  ^■p'dichiU^  ^pTightchild  ^ 

(14)  end  if 

(15)  end  for 
end  Replicate' 

procedure  Collect'(p,  d,  start) 

(16)  for  h  < —  in  downto  1 

(1  /  )  if  Z^start,md)  0  tlieil 

(IS)  d,  h:  <S=  V 

(19)  end  if 

(20)  end  for 
end  Collect' 
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procedure  PassUp(p,  d,  start,  select) 

(21)  for  h  < —  m  downto  1 

(22)  if  zi,tart,md+h)  =  -se^ec<[o,m(i+fc-i<ar<)  then 

(23)  d,h:  ^*p 

(24)  end  if 

(25)  end  for 
end  PassUp 

procedure  PassDown(p,  d,  start) 

(26)  for  h  < —  1  to  m 

(27)  if  Z[suri,md+h)  =  0  then 

(28)  d,  h:  *p  <= 

(29)  end  if 

(30)  end  for 
end  PassDown 

4  Network  Implementations  of  DTEP 

In  this  section  we  will  present  several  implementations  of  the  DTEP  algorithm.  In  every 
case  each  processor  maintains  a  set  of  nine  local  variables:  P,-,  P,,  Pk,  Pij,  P.^,  P^j,  Pijj., 
previous,  change.  The  subscripts  which  appear  on  the  first  seven  variables  do  not  denote 
indexing  in  the  usual  sense;  they  are  intended  to  indicate  what  value  the  variable  is  ex¬ 
pected  to  contain  at  a  particular  processor.  Every  processor  has  an  M  bit  2r  field  in  its  id 
which  can  be  split  into  three  m  bit  fields  corresponding  to  i,  j  and  k.  Formally,  we  have 

^[2m,M)  ~  ii  ^[m,2tn)  ~  ■^[O.m)  ~  h 

or  equivalently,  z  =  oifo.m)  o  It  will  be  convenient  to  refer  to  a  processor  with 

z  =  iojokas  processor  (i,j,  k).  This  notation  is  unambiguous  for  the  hypercube,  a  single 
rank  of  the  butterfly,  the  perfect  shuffle  and  the  leaf  processors  of  the  three-dimensional 
mesh  of  trees  since  there  is  exactly  one  processor  corresponding  to  each  possible  triple. 
For  example,  at  processor  (*,j,  k)  the  variable  Pkj  will  “normally”  contain  the  value  of  the 
element  akj  in  the  k  row  and  jth  column  of  the  n  x  n  direct  implication  matrix  maintained 
by  the  DTEP  algorithm.  Although  not  explicitly  subscripted,  change  and  previous  depend 
on  i  alone. 

In  order  to  assist  the  reader  in  following  our  programs,  every  line  which  affects  the 
values  of  one  or  more  local  variables  is  labelled  with  a  corresponding  number  of  triples  in 
the  right  margin.  The  triple  indicates  how  the  values  of  a  particular  variable  are  distributed 
amongst  the  processors.  For  instance,  line  5  of  Section  4.1  assigns  a  value  to  Pj  and  is 
labelled  with  {*,j,*).  This  means  that  all  processors  with  the  same  j  field,  Z[,n,2m)-,  also 
have  the  same  value  for  Pj,  ie.  Pj  does  not  currently  depend  on  the  i  or  k  fields.  As 

^Sometimes  we  will  write  such  an  equation  as  simply  ;  =  i  o  j  o  k  when  the  intended  “width”  of  the 
integers  on  the  right  hand  side  of  the  equation  is  clear;  leading  zeros  should  be  preserved  accordingly. 
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another  example,  consider  line  18  in  the  same  program.  It  assigns  a  value  to  Pkj  and 
the  corresponding  triple  is  This  says  that  the  value  of  akj  (as  defined  above)  is 

currently  stored  in  local  variable  Pkj  at  those  processors  with  z^2m,M)  k  and  Z[m.,2m)  —  j  i 
regardless  of  the  value  of  the  k  field. 

The  input  to  DTEP  consists  of  n  Pi  values,  Py  values  and  n®  Pijk  values.  Unless 
otherwise  specified,  these  will  be  assumed  to  reside  in  processors  (i,0,0),  (i,j,  0)  and 
(f,  j,  it)  respectively,  at  the  start  of  execution.  The  output  is  given  by  the  final  values  of  Pi 
in  processors  (z,  0,  0).  We  have  assumed  that  any  processor  can  terminate  the  execution 
of  the  entire  machine,  which  eliminates  the  need  to  broadcast  a  termination  flag  in  every 
iteration  of  the  loop.  Even  if  all  processors  must  halt  independently,  the  cost  of  this 
broadcast  can  be  hidden  from  the  inner  loop  analysis  by  employing  a  termination  bit  in 
every  message.  The  idea  is  that  every  time  a  processor  receives  a  message  it  will  check 
the  termination  bit.  If  it  is  set,  that  processor  broadcasts  termination  to  its  neighbors  and 
then  halts. 

4.1  Hypercube 

The  program  below  implements  the  DTEP  algorithm  and  performs  inter-processor  com¬ 
munication  solely  through  calls  to  Replicate  and  Collect.  By  simply  plugging  in  the  routines 
developed  in  the  previous  section,  one  obtains  O(log  n)  time  implementations  of  the  DTEP 
inner  loop  for  all  four  of  the  networks  we  are  studying.  The  program  works  as  follows. 
Lines  1  and  2  copy  the  input  P,-  and  Py  values  to  processors  (z,*,*)  and  (z,i,  *)  respec¬ 
tively.  Line  4  initializes  Pj,  Pk  and  saves  the  current  set  of  Pi  values  in  previous.  Lines  5 
and  6  redistribute  Pj  and  Pk  so  that  they  depend  upon  the  appropriate  fields  of  bits  in  the 
processor  id.  Lines  7  and  8  attempt  to  derive  more  Pi,  Pij  values.  Lines  9  and  10  collect 
and  distribute  the  updated  set  of  P  values.  Lines  11  to  15  check  to  see  whether  any  new 
Pj  has  been  derived.  Lines.  16  and  17  collect  and  distribute  the  new  set  of  Pij  values.  Lines 
18  to  20  produce  appropriately  transposed  versions  of  [pj]  in  the  Pjt  and  Pkj  variables. 
Lines  21  to  23  complete  the  matrix  multiplication;  line  21  performs  the  “multiplications” 
while  lines  22  and  23  perform  the  “additions” . 

Running  on  a  hypercube  the  complexity  of  this  program  is  given  by  the  entries  in  the 
last  two  columns  of  Table  2.  We  can  reduce  the  step  count  to  9log?z  by  using  the  version 
of  Collect  with  0{N logN^  communication  cost  described  in  Section  3.1,  which  allows  lines 
10,  17  and  23  to  be  eliminated. 

In  a  MIMD  environment  and  with  a  larger  hypercube,  there  is  another  level  of  paral¬ 
lelism  which  can  be  exploited:  independent  computations  can  be  performed  at  the  same 
time  on  separate  subcubes  of  size  N.  The  loop  can  be  restructured  so  that  it  runs  in  31ogn 
steps  on  a  hj^percube  with  4iV  processors,  ie.  four  subcubes  of  size  N.  Assuming  that  lines 
7  and  8  are  moved  to  the  top,  the  first  log  n  steps  make  use  of  two  subcubes  to  perform  the 
first  half  of  the  computation  of  line  9  and  the  entire  computation  of  line  16  simultaneously. 
The  other  two  subcubes  are  idle  during  this  period  of  time.  During  the  second  log  /?,  steps, 
three  subcubes  are  used  to  complete  the  computation  of  line  9  while  performing  lines  19 
and  20.  All  four  subcubes  are  used  during  the  third  and  final  set  of  log??  steps  in  order  to 
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Network 


hypercube 
butterfly 
perfect  shuffle 
3D  mesh  of  trees 


Processors 


Minimum 

Steps  Communication 

91ogn 

12  log  n 

13  log  n 

17  log  n 

13  log  n  12N  -|-  0{rP) 

16  log  n  2N  log  N  -I-  0{N) 
23  log  n  N\ogN^O{N) 

19  log  n  lOA"  -l-O(n^logn) 

Table  2:  Analysis  of  DTEP  inner  loop  implementations. 


execute  lines  5,  6,  13  and  22  simultaneously.  Notice  that  this  MIMD  algorithm  would  be 
easy  to  implement  since  each  of  the  four  subcubes  operates  in  a  SIMD  manner. 


procedure  DTEP 
Replicate(&P,-,  0,  2m,  0) 

Replicate(&Py ,  0,  m,  0) 
loop 

previous ,  Pj ,  Pk  * —  Pi 
Replicate(&Pj,  2m,  m,  Z[m,2m)) 
Replicate(&Pfc,  2m,  m,  ^[o,m)) 

Pi  (Pj  A  Pfc  A  Pijk)  V  {Pj  A  Py ) 
Pij  ^  Pfc  A  Pijk 
Collect  (&P,-,  0,  2m) 

)  Replicate(&P,-,  0,  2m,  0) 

I  if  Ho, 2m)  =  0  then 

I  change  < —  p  ^  previous 

i  Collect(&c/ianye,  2m,  m) 

I  exit  when  -^change  at  (0,0,0) 

i  end  if 

Collect(&Py,  0,  m) 

Replicate(&Py ,  0,  m,  0) 

P,:,,p,  ^Py 
Replicate(i;Pa-,  m,  m,  2[o,m)) 
Rep\\cate(kPkj ,  2m,  m,  2[o,m)) 

Pij  < —  Pik  A  Pkj 
Collect(&;Py .  0,  m) 

Replicate(&:Py ,  0,  m,  0) 
end  loop 
end  DTEP 


(ij,*) 

(i,0,0) 

(i,*,*) 

(b0,0) 

(0,0,0) 


(bi,o) 

(bi)*) 

{i,k,*),(kj,*) 

(i,*,k) 

(ijy  0) 

(bi,*) 


4.2  Butterfly 

As  shown  ill  Table  2,  the  butterfly  implementation  uses  jVlogiV  i^rocessors.  For  conve¬ 
nience,  we  Inn'e  assumed  that  the  input  P,-  values  arc  to  be  found  in  rank  2m  and  the  input 


14 


Pij  values  are  in  rank  m.  The  output  Pj  values  are  in  rank  0.  In  order  to  minimize  com¬ 
munication  complexity  it  is  necessary  to  eliminate  as  many  calls  to  Adjust  as  possible  since 
it  uses  N  log  N  messages.  We  were  able  to  get  rid  of  all  but  two,  so  the  communication 
cost  is  as  shown  in  Table  2.  As  it  stands  the  algorithm  has  step  complexity  171ogn.  This 
can  be  reduced  to  16  log  n  by  concatenating  P,  and  P^  in  order  to  perform  lines  7  and  8 
with  a  single  call  to  Replicate'. 

For  the  minimum  step  count  version,  the  idea  that  we  used  for  the  hypercube  applies 
once  again.  In  this  case  lines  12,  19  and  27  can  be  eliminated  at  the  expense  of  a  constant 
factor  increase  in  communication  cost.  However,  this  cannot  be  done  without  further 
restructuring  since  the  rank  in  which  the  sets  of  values  in  question  are  left  is  affected.  It  is 
not  difficult  to  perform  this  restructuring  in  order  to  obtain  a  step  count  of  12  log  n.  This 
is  3  log  n  higher  than  for  the  hypercube  because  three  adjustments  are  performed. 

Under  a  limited  MIMD  model  in  which  individual  ranks  still  operate  in  a  SIMD  fashion, 
the  butterfly  with  N\ogN  processors  can  achieve  a  step  count  of  51ogn,  as  stated  in 
Table  3.  Calls  to  Replicate  and  Collect  which  make  use  of  disjoint  rank  intervals  may  be 
performed  simultaneously,  while  those  for  which  the  intervals  overlap  can  be  pipelined. 


procedure  DTEP 


(1) 

Replicate"(&P,,  0,  2m,  0) 

0:  (e,*,*) 

(2) 

Replicate"(&Py,  0,  m,  0) 

0:  («,;» 

(3) 

loop 

(4) 

0:  previous,  Pj,Pk  < —  p 

0:  (b*,*),(i,*,*),(^ 

(5) 

Replicate"(&P,-,  2?n,  m,  z[,„,2m)) 

2m:  {*,j,*) 

(6) 

Replicate"(&Pi,  2m,  m,  2[o,m)) 

2m:  (*,*,k) 

(7) 

Replicate'(&:P,-,  2m,  m,  0) 

0:  (*,i,*) 

(8) 

Replicate'(&:Pfc,  2m.,  m,  0) 

0:  (*,*,k) 

(9) 

0:  Pi  ^  (Pj  A  Pfc  A  Pyt)  V  (Pj  A  Pij) 

0:  (i,j,k) 

(10) 

0:  Pij  <r^  Pfc  A  Pijk 

0:  (i,j,k) 

(11) 

Collect'(&P,,  0,  2m) 

to 

o 

o 

(12) 

Replicate"(&;P,-,  0,  2m.,  0) 

0:  (?',*,*) 

(13) 

if  •2[o,2m)  =  0  then 

(14) 

0;  change  < —  P,  ^  previous 

o 

o 

o 

(15) 

Collect" (Si change ,  2m,  m.) 

2m:  (0,0,0) 

(16) 

exit  when  ->change  at  2m:  (0,0,0) 

(17) 

end  if 

(18) 

Collect'(&Py ,  0,  m) 

m:  .(i,j,0) 

(19) 

Replicate"(&:Py ,  0.  m,  0) 

0:  (i,  j,*) 

(20) 

Adjust(&Py,  0,  2?t/) 

2m:  (i,j,*) 

(21) 

2m:  Pik,Pkj  < —  Pi] 

2m:  (i,k,*),(k,j,*) 

(22) 

Replicate"(&:P,7:,  m,  m,  2[o,m)) 

m:  (i,*,k) 

(23) 

Adjust(&:P,'7,  m,  2m) 

0:  (i,*,k) 

(24) 

Replicate'(&Ptj,  2m,  m,  ^[o,m)) 

0:  (*,j,k) 

(25) 

o 

1 

> 

0:  {i,j,k) 

(26) 

Collect'(&:P„,  0,  m  ) 

m:  (i,j,0) 
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Network 

Processors 

Minimum 

hypercube 

4iV 

31ogn 

butterfly 

NlogN 

51ogn 

3D  mesh  of  trees 

AN  -  ZrP 

81ogn 

Table  3:  Minimum  step  counts  for  MIMD  implementations. 

(27)  RepIicate"(&Pij,  0,  m,  0)  0:  (i,y, ♦) 

(28)  end  loop 
end  DTEP 


4.3  Perfect  Shuffle 

For  the  perfect  shuffle  implementation  it  is  convenient  to  assume  that  Pi  is  given  in  (0, 0,  i) 
and  Pij  in  (0,  i,j).  The  output  value  of  P,-  is  still  to  be  found  in  processor  (i,  0, 0),  however. 
We  were  able  to  eliminate  all  but  one  of  the  calls  to  Cycle  so  the  communication  cost 
is  as  shown  in  Table  2.  There  is  an  interesting  trick  which  can  be  used  to  decrease  the 
number  of  steps  per  iteration  by  21ogn.  As  observed  by  Dekel  et  al.,  the  perfect  shuffle 
can  compute  the  transpose  of  the  product  of  two  matrices  more  rapidly  than  the  actual 
product  [DNS81].  This  fact  may  be  used  to  essentially  get  rid  of  the  calls  to  Replicate  on 
lines  20  and  21.  In  order  to  make  use  of  the  transpose  of  [Pij]^  it  is  necessary  to  unroll  the 
loop  body  by  a  factor  of  two  and  maintain  some  extra  local  variables;  Unfortunately,  there 
is  now  a  data  alignment  problem  between  consecutive  iterations.  This  could  be  solved  with 
a  call  to  Cycle  or  by  unrolling  the  loop  body  by  a  further  factor  of  three.  Of  course,  the 
results  in  Table  2  reflect  the  latter  choice. 


procedure  DTEP 


(1) 

Replicate'(&Pi,  2??7,  0) 

(2) 

Replicate'(&Py ,  m,  0) 

ihh*) 

(3) 

loop 

(4) 

■previous,  Pj  < —  P,- 

(?>,*), 

(5) 

Replicate"(&;P,,  m,  2[o,m)) 

(*,i,*) 

(6) 

Pk  ^  Pi 

(*,  k,*) 

(7) 

Replicate"(&:Pfc,  m,  0) 

{*,*,k) 

(8) 

p  ^  (P,-  A  Pit  A  Pyi )  V  ( P,-  A  Pij  ) 

(9) 

Pij  ^  Pk  A  Pijk 

(10) 

Collect"(&:Pi,  2m) 

■  (0,0,0 

(11) 

Replicate'(&P,,  2jn,  0) 

(b  *,  *) 

(12) 

if  Z[o,2m)  =  0  then 

(13) 

change  < —  Pi  ^  previous 

(b0,0) 

(14) 

CoWeci' {Hz change,  m) 

(0,0,0) 

(15) 

exit  when  ~>change  at  (0,0,0) 

(16) 

end  if 

IG 


(17) 

Collect"(&Py ,  m) 

(0,0i) 

(18) 

Replicate'(&Py,  m,  0) 

(Oil*) 

(19) 

Pkj  *  Pij 

(kj,*) 

(20) 

Replicate'(&Pij,  m,  Z[o,m)) 

(ii  k,  *) 

(21) 

Replicate"(&:Pytj,  m,  0) 

(*,j,k) 

(22) 

P%k  * —  Pkj 

(*,k,i) 

(23) 

Cycle(&P,A,  2m) 

{i,*,k) 

(24) 

Pij  < —  Pik  A  Pkj 

(ij,k) 

(25) 

Collect"(&Py,  m) 

(0,i,J) 

(26) 

Replicate'(&:Pij,  m,  0) 

(Oi,*) 

(27) 

end  loop 

end  DTEP 


Since  there  are  quite  a  few  minor  differences  between  it  and  the  preceding  program, 
the  minimum  step  count  version  is  presented  in  its  entirety.  The  input/output  variables 
axe  the  same  except  that  P,-  begins  in  processor  (i,0, 0).  As  above,  it  is  possible  to  save 
2  log  n  steps  by  loop  unrolling  and  computing  the  transpose  of  the  square;  in  this  case  it 
is  the  work  performed  by  lines  44  and  49  which  becomes  unnecessary. 

At  this  point,  one  might  hope  to  obtain  a  MIMD  version  with  an  even  lower  step  count, 
as  we  did  for  the  hypercube.  Unfortunately,  the  perfect  shufSe  organization  does  not  lend 
itself  well  to  partitioning  schemes;  a  significant  amount  of  overhead  seems  to  be  necessary 
to  maintain  the  partition.  In  the  present  case,  it  appears  that  the  extra  steps  required  to 
handle  this  overhead  would  entirely  .offset  any  potential  decrease,  so  this  strategy  is  not 
worthwhile. 


procedure  DTEP 

(28)  Replicate"(&Pj,  2m,  0) 

(29)  Replicate'(&:Pjj ,  m,  0) 

(30)  loop 

(31 )  previous ,  Pj ,  Pk  < —  P,- 

(32)  Cycle(&P,-,  2m) 

(33)  Cycle(&:Pj,  m) 

(34)  Pi  ^  (P,-  APkA  Piik)  V  (P,  A  Pij) 

(35)  p,.^p,APiik 

(36)  Collect"(&Pt,  2m) 

(37)  if  =  0  then 

(38)  change  < —  P,-  ^  previous 

(39)  Collect"(&c/ian^e,  m) 

(40)  exit  when  -‘change  at  (0,0,0) 

(41)  end  if 

(42)  Collect"(&Py,  m.) 

(4.3)  Pki  ^  P,  ■ 

(44)  RepWcBle" {kPkj,  m,  Z[2m.M)) 

(45)  P,:,  Pk, 


{*,*,i) 

ihj,*) 

ihjj-) 

{*,*,i) 

(0,0,0 

(0,0,0) 


(*,A.O 
(*,  k^j) 

(*,  k,i) 
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(46) 

Cyde(&P,,i;,  2m) 

(47) 

Pij  < —  Pik  A  Pkj 

(48) 

Collect"(&Py ,  m) 

(49) 

Cycle(&Py,  m) 

(50) 

end  loop 

end  DTEP 


4.4  Multi-Dimensional  Mesh  of  Trees 

Our  multi-dimensional  mesh  of  trees  implementation  is  only  a  slightly  modified  version  of 
the  program  given  in  Section  4.1.  By  eliminating  three  redundant  PassUp,  PassDown  pairs 
we  obtain  the  step  count  and  communication  cost  stated  in  Table  2.  For  example,  lines  9 
and  10  from  Section  4.1  get  translated  into  the  block  of  code  given  below. 


(1)  Collect'(^fi,  0,  0) 

(2)  PassDown(&Pi,  0,  0) 

(3)  Collect'(&P,  1,  0) 

(4)  Replicate'(&Pj,  1,  0,  0) 

(5)  PassUp(&:Pj,  0,  0,  0) 

(6)  Replicate'(&:P,,  0,  0,  0) 


The  minimum  step  count  version  can  be  achieved  by  using  more  messages  in  lines  2 
to  4  so  that  5  and  6  can  be  eliminated.  This  does  not  result  in  an  asymptotic  increase  in 
message  complexity;  it  just  increases  the  coefficient  on  the  leading  term  from  10  to  12. 

Using  the  techniques  we  have  discussed  for  the  other  networks,  it  is  easy  to  derive 
a  MIMD  implementation  of  the  DTEP  inner  loop  which  runs  in  Slogn  steps  without 
increasing  the  number  of  processors. 

5  Conclusions 

Tables  2  and  3  summarize  the  results  of  our  analysis.  The  communication  cost  of  our  im¬ 
plementations  could  be  further  reduced  by  only  attempting  to  derive  Pi  at  those  processors 
whei'e  it  is  false.  Note  that  this  requires  data  de^Dendent  conditions  for  enabling/disabling 
processors. 

It  is  possible  to  use  bit  compression  techniques  to  reduce  the  processor  requirements  of 
every  one  of  our  implementations  by  a.  factor  of  log  n  [P1S7].  For  all  of  the  networks  we  have 
considered  except  the  perfect  shuffle,  this  can  be  done  without  increasing  the  coefficient 
on  the  leading  term  of  the  running  time.  For  the  same  set  of  networks,  an  extension  of  an 
idea  due  to  Dekel  &  Sahni  [DSS3]  alkws  the  processor  requirements  to  be  lowered  by  an 
additional  factor  of  logn.  However,  this  reduction  increases  the  running  time  by  a  constant 
factor  and  requires  a  MIMD  model  for  the  butterfly  and  multi-dimensional  mesh  of  trees 
[P1S7]. 


IS 


A  List  of  Symbols 


& 

address  operator 

* 

indirection  operator;  also  used  as  wildcard 

V 

logical  OR  operator 

A 

logical  AND  operator 

logical  negation  operator 

= 

logical  equivalence  operator 

equality  operator 

< — 

local  assignment  operator 

X  * —  y 

X  < —  X  op  y 

inter-processor  assignment  operator 

op 

X  <=  y 

X  +=  X  op  y 

(c)?  X  :  y 

conditional  expression:  if  c  then  x  else  y 

[a,  6] 

{a  <b)  ?  {a,  a  +  1, . . . ,  6}  :  {} 

[a,  b) 

(a  <  6)  ?  {a,  a  +  1, . . . ,  6  -  1}  :  {} 

(a,  6] 

(a  <  6)  ?  {a  +  1,  a  +  2, . . . ,  6}  :  {} 

(a,  6) 

((i  +  l<i))?  {a  +  l,a  +  2, 1}:  {} 

Xi 

ith  bit  of  X  (low  order  bit  is  Xq) 

{a  <b)  ?  {xbXb-i  ■  ■  •  Xa)2  :  0 

0 

bit  string  concatenation,  eg.  l[o,2)  0  12[i,4)  =  OI2  0  IIO2  =  OHIO: 

< 

shift  left  operator,  eg.  IOI2  ■<  3  =  IOIOOO2 

> 

shift  right  operator,  eg.  IOI2  ^  1  =  IO2 

© 

bitwise  XOR  operator 

X  Bit  Z 

X  at  processor  2 

[clij] 

the  matrix  of  Cp  ’s 

\ogx 

log2  X 

&(/(")) 

0{f{n))  and  fi(/(n)) 

For  each  network  family  we  require  some  additional  notation  for  specifying  processor 
ids  and  adjacencies.  For  the  hypercube  we  have 

2:  M  bit  processor  id 

X  at  2  0  2^ 


Each  processor  in  a  butterfly  network  has  a  processor  id  consisting  of  two  components: 
rank  and  r.  The  following  notation  is  used 

rank  [logAf]  high  bits  of  id;  specifies  rank 

2  M  low  bits  of  id;  specifies  position  within  rank 

X  at  {rank  +  1,2) 

;r  at  {rank  —  1,2) 

;rkj  X  at  {rank  +  1,20  2™"*) 
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a;  at  (ranib  -  1,^  © 


Also,  if  a  statement  is  labelled  with  a  number  r  followed  by  a  colon  then  it  is  executed  only 
at  those  processors  with  rank  =  r.  A  butterfly  with  N  processors  per  rank  has  log  N  ranks; 
we  identify  the  top  arid  bottom  ranks.  All  arithmetic  involving  ranks  should  be  assumed 
to  be  performed  modulo  log  AT  (eg.  a:_i  at  (0,  z)  is  the  same  as  x  at  (logiV  - 
For  the  perfect  shuffle  we  specify  ids  and  adjacencies  in  the  following  manner 


z 

M  bit  processor  id 

^exchange 

a:  at  z  ®  1 

unary  rotate  left  operator,  eg.  101 II2  =  011112 

unary  rotate  right  operator,  eg.  OIOII2  =  10101 

^shuffle  ^ 

X  at  ^  z 

X  at  ^  z 

For  the  multi- dimensional  mesh  of  trees  family,  we  assign  each  processor  an  id  which 
is  most  conveniently  thought  of  as  a  triple  {dim ^height,  z).  The  height  of  a  processor  is  its 
distance  from  the  nearest  root.  Assume  N,  n,  k  are  as  defiried  in  Table  1  and  M  =  log  N, 
m  =  log  ri.  The  dim  field  is  irrelevant  for  the  leaf  processors  (those  with  height  =  m) 
since  they  each  belong  to  every  dimension.  We  use  the  following  notation 


dim 

height 

z 

z^ 

subst{z,x,d) 

^parent 
^left  child 
^righichild 
^sibling 


[log  k]  high  bits  of  id;  belongs  to  [0,  k) 

[log(m  H-  1)]  middle  bits  of  id;  belongs  to  [0,m] 

M  low  bits  of  processor  id 

^\rnd,md-\-m)  1  dG  [0,  A;) 

®  ®  ^[0,md)i  d  G  lo.fc) 

X  at  {dim,  height  —  l,subst{z,z^*'^  >>  l,dim)) 

X  at  {dim,  height  +  l,subst{z,z^*'^  \,dim)) 

X  at  {dim,  height  +  1,  subst{z,  {z^'”^  <C  1)  +  1,  dim)) 
X  at  {dim,  height,  subst{z,  z^’™  ®  1,  dim)) 


Note  that  a  reference  to  the  parent  of  a  leaf  processor  is  not  well  defined  unless  it  is 
accompanied  by  a  dimension.  In  our  programs  the  intended  dimension  of  the  parent  of  a 
leaf  will  not  be  given  explicitly  but  should  be  obvious.  If  a  statement  is  labelled  with  a 
pair  d,  fi.  followed  by  a  colon  then  it  is  only  executed  at  those  processors  with  dim  =  d  and 
height  —  h. 
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